Statistical unicodification of African languages
نویسنده
چکیده
Many languages in Africa are written using Latin-based scripts, but often with extra diacritics (e.g. dots below in Igbo: i., o. , u. ) or modifications to the letters themselves (e.g. open vowels “e” and “o” in Lingala: ε, O). While it is possible to render these characters accurately in Unicode, oftentimes keyboard input methods are not easily accessible or are cumbersome to use, and so the vast majority of electronic texts in many African languages are written in plain ASCII. We call the process of converting an ASCII text to its proper Unicode form unicodification. This paper describes an opensource package which performs automatic unicodification, implementing a variant of an algorithm described in previous work of De Pauw, Wagacha, and de Schryver. We have trained models for more than 100 languages using web data, and have evaluated each language using a range of feature sets.
منابع مشابه
La Gestion de la Diversité Linguistique dans les Villes Africaines/Management of Linguistic Diversity in African Urban Cities, Gabriel Mba & Etienne Sadembouo (Eds.). (2012), L’Harmattan, ISBN 978-2-296-99091-3
متن کامل
A Literary Anthroponomastics of Three Selected African Novels: A Cross Cultural Perspective
Names as markers of identity are a source of a wide variety of information. This paper explores the names of characters to show the sociocultural factors which influence the choice of names and the effects that the names of these characters have on the roles they play. Using a variety of personal names from Ayi Kwei Armah’s Fragments, Buchi Emecheta’s The Joys of Motherhood, a...
متن کاملLexical Semantics and Selection of TAM in Bantu Languages: A Case of Semantic Classification of Kiswahili Verbs
The existing literature on Bantu verbal semantics demonstrated that inherent semantic content of verbs pairs directly with the selection of tense, aspect and modality formatives in Bantu languages like Chasu, Lucazi, Lusamia, and Shiyeyi. Thus, the gist of this paper is the articulation of semantic classification of verbs in Kiswahili based on the selection of TAM types. This is because the sem...
متن کاملIdentity and Representation through Language in Ghana: The Postcolonial Self and the Other
Research related to colonialism and post colonialism shows how the identities of indigenous people were constructed and how these identities are reconstructed in our contemporary world. The thrust of this paper is that colonialism brought a shift in the linguistic structure of Ghana with the introduction of the use of English among Ghanaians. The coexistence of both Ghanaian languages and Engli...
متن کاملExploring unsupervised word segmentation for machine translation in the South African context
We explore the application of unsupervised word segmentation algorithms to phrase-based statistical machine translation (SMT) systems, translating from English to four South African languages: Afrikaans, Northern Sotho, Tsonga and Zulu. Positive results in terms of the standard BLEU and NIST scores are obtained for systems translating into Afrikaans and Zulu.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Language Resources and Evaluation
دوره 45 شماره
صفحات -
تاریخ انتشار 2011